feat(BA-2851): Add resource isolation options for multi-agent #6498

hhoikoo · 2025-10-31T01:29:19Z

This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.

SHARED: allows all agents to see full resources (useful for stress testing). This is the same behavior as before.
AUTO_SPLIT: automatically divides resources equally among agents.
MANUAL: lets users specify exact per-agent allocations for all resources.

Single-agent deployments remain unaffected and retain access to all available hardware resources.

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

Copilot

Pull Request Overview

This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).

Key changes:

Introduces ResourcePartitioner class to manage resource allocation across agents
Adds ResourceAllocationMode enum with SHARED, AUTO_SPLIT, and MANUAL modes
Implements validation logic to ensure consistent manual allocations across agents
Updates agent initialization to use resource partitioning

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.

Show a summary per file

File	Description
src/ai/backend/agent/resources.py	Adds `ResourcePartitioner` class and changes abstract methods to raise NotImplementedError
src/ai/backend/agent/config/unified.py	Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic
src/ai/backend/agent/agent.py	Integrates ResourcePartitioner into agent initialization and updates slot calculations
src/ai/backend/agent/server.py	Creates ResourcePartitioner instances per agent and adds resource reconciliation
src/ai/backend/agent/docker/agent.py	Adds resource_partitioner parameter to constructor
src/ai/backend/agent/kubernetes/agent.py	Adds resource_partitioner parameter to constructor
tests/agent/test_resource_allocation.py	Comprehensive unit tests for all three allocation modes
tests/agent/test_config_validation.py	Tests for config validation of allocation modes and device consistency
tests/agent/docker/test_agent.py	Updates test to pass ResourcePartitioner to agent
changes/6498.feature.md	Changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/agent/test_resource_allocation.py

src/ai/backend/agent/resources.py

src/ai/backend/agent/agent.py

src/ai/backend/agent/resources.py

HyeockJinKim · 2025-11-14T03:03:07Z

src/ai/backend/common/types.py


 class SlotName(UserString):
    __slots__ = ("_parsed", "_device_name", "_major_type", "_minor_type")
+    __match_args__ = ("device_name", "major_type", "minor_type")


Why was this added?

It's not necessary but I wanted to use pattern matching here (https://github.com/lablup/backend.ai/pull/6498/files#diff-a4da2a344d73525736025bcd638112245de4a7225d6293d21a7e1e5152224ec8R675) and this class needs this added for match statement to work

HyeockJinKim · 2025-11-14T04:04:21Z

src/ai/backend/agent/resources.py

+    async def _load_resources(self) -> Mapping[DeviceName, AbstractComputePlugin]:
+        local_config_dump = self.local_config.model_dump(by_alias=True)
+
+        match self.local_config.agent_common.backend:
+            case AgentBackend.DOCKER:
+                from .docker.resources import load_resources as docker_load
+
+                return await docker_load(self.etcd, local_config_dump)
+            case AgentBackend.KUBERNETES:
+                from .kubernetes.resources import load_resources as kubernetes_load
+
+                return await kubernetes_load(self.etcd, local_config_dump)
+            case AgentBackend.DUMMY:
+                from .dummy.config import DEFAULT_CONFIG_PATH, dummy_local_config
+                from .dummy.resources import load_resources as dummy_load
+
+                raw_config, _ = read_from_file(DEFAULT_CONFIG_PATH, "dummy")
+                dummy_config = dummy_local_config.check(raw_config)
+                return await dummy_load(self.etcd, local_config_dump, dummy_config)
+
+    async def _scan_available_resources(self) -> Mapping[SlotName, Decimal]:
+        compute_device_types = {name: cctx.instance for name, cctx in self.computers.items()}
+
+        match self.local_config.agent_common.backend:
+            case AgentBackend.DOCKER:
+                from .docker.resources import scan_available_resources as docker_scan
+
+                return await docker_scan(compute_device_types)
+            case AgentBackend.KUBERNETES:
+                from .kubernetes.resources import scan_available_resources as kubernetes_scan
+
+                return await kubernetes_scan(compute_device_types)
+            case AgentBackend.DUMMY:
+                from .dummy.resources import scan_available_resources as dummy_scan
+
+                return await dummy_scan(compute_device_types)


This change, in terms of extensibility, rather seems like a regressive implementation and doesn't look good.

This change implements configuration for partitioning resources. SHARED mode allows all agents to see full resources (useful for stress testing). This is the same behavior as before. AUTO_SPLIT automatically divides resources equally among agents. MANUAL mode lets users specify exact per-agent allocations for all resources. Single-agent deployments remain unaffected and retain access to all available hardware resources.

This change modifies the semantics of ResourcePartitioner so that it now takes ownership over the devices and injects partitioned devices to individual agents after initialization.

This change fixes a bug with resource splitting, where reserved resources were accidentally being included in the total allocated for each agent. This is because the way total slots are handled was malformed, where the calculation of reserved resources from the perspective of a single agent was being done without taking account of server reserved resources properly. This change fixes this issue by inverting the condition, where reserved resources are deducted only in places where it is needed.

tests/agent/test_resource_allocation.py

hhoikoo · 2025-11-14T09:47:35Z

Will create a new PR

github-actions bot assigned hhoikoo Oct 31, 2025

github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 31, 2025

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e6c1f4b to d84258e Compare October 31, 2025 01:30

hhoikoo requested a review from Copilot October 31, 2025 01:30

Copilot AI reviewed Oct 31, 2025

View reviewed changes

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from d84258e to c5114a9 Compare October 31, 2025 03:56

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 310d847 to 3faac0f Compare November 4, 2025 06:02

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 36824ac to 279e71b Compare November 4, 2025 06:30

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 280831f to db07080 Compare November 4, 2025 10:32

github-actions bot added the comp:manager Related to Manager component label Nov 4, 2025

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from ce120ef to 13c7be6 Compare November 6, 2025 01:01

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from e2b1902 to 9c34302 Compare November 6, 2025 01:08

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 80ecc2c to 7462fd8 Compare November 6, 2025 01:10

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9c34302 to 04a0f3a Compare November 6, 2025 01:47

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 7462fd8 to e17be1c Compare November 6, 2025 01:52

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 04a0f3a to 92c3bd7 Compare November 6, 2025 01:58

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e17be1c to d936ce3 Compare November 6, 2025 01:59

hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 92c3bd7 to f0f7510 Compare November 6, 2025 02:07

github-actions bot added the comp:common Related to Common component label Nov 12, 2025

hhoikoo removed the comp:manager Related to Manager component label Nov 12, 2025

hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 66b563f to 89a0562 Compare November 13, 2025 08:01

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 2d01787 to 6f857be Compare November 13, 2025 09:49

hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 159bd39 to 80a5685 Compare November 14, 2025 00:55

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from a8c60e7 to 4d718b6 Compare November 14, 2025 00:56

hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch 2 times, most recently from b8cfb1b to 33552a1 Compare November 14, 2025 02:04

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 4d718b6 to c0102e8 Compare November 14, 2025 02:34

HyeockJinKim requested changes Nov 14, 2025

View reviewed changes

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from c0102e8 to 73bf8f3 Compare November 14, 2025 04:31

Base automatically changed from feat/BA-3024/multi-agent-resources-config to feat/BA-2753/multiple-agents November 14, 2025 04:49

Base automatically changed from feat/BA-2753/multiple-agents to refactor/BA-3028 November 14, 2025 04:49

Base automatically changed from refactor/BA-3028 to feat/BA-3023/multi-agent-etcd November 14, 2025 04:49

Base automatically changed from feat/BA-3023/multi-agent-etcd to feat/BA-3026/agent-runtime November 14, 2025 04:49

Base automatically changed from feat/BA-3026/agent-runtime to feat/BA-2752/multiple-agents-config November 14, 2025 04:50

Base automatically changed from feat/BA-2752/multiple-agents-config to feat/BA-2750/config-table-syntax November 14, 2025 04:50

hhoikoo changed the base branch from feat/BA-2750/config-table-syntax to main November 14, 2025 05:18

hhoikoo added 6 commits November 14, 2025 14:31

feat(BA-2851): Move ownership of devices to resource partitioner

4192349

This change modifies the semantics of ResourcePartitioner so that it now takes ownership over the devices and injects partitioned devices to individual agents after initialization.

fix(BA-2851): Ensure correct fixtures are used in runtime tests

36f4227

fix(BA-2851): Remove useless comment

14787fc

fix(BA-2851): Ensure new resource allocation format is used

2c41c25

hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 73bf8f3 to 2c41c25 Compare November 14, 2025 05:32

github-advanced-security bot found potential problems Nov 14, 2025

View reviewed changes

tests/agent/test_resource_allocation.py Dismissed Show dismissed Hide dismissed

tests/agent/test_resource_allocation.py Dismissed Show dismissed Hide dismissed

hhoikoo closed this Nov 14, 2025

hhoikoo deleted the feat/BA-2851/multi-agent-resources branch November 14, 2025 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(BA-2851): Add resource isolation options for multi-agent #6498

feat(BA-2851): Add resource isolation options for multi-agent #6498

Uh oh!

hhoikoo commented Oct 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim Nov 14, 2025

Uh oh!

hhoikoo Nov 14, 2025

Uh oh!

HyeockJinKim Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

hhoikoo commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(BA-2851): Add resource isolation options for multi-agent #6498

feat(BA-2851): Add resource isolation options for multi-agent #6498

Uh oh!

Conversation

hhoikoo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

hhoikoo Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

HyeockJinKim Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hhoikoo commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hhoikoo commented Oct 31, 2025 •

edited

Loading